Influenza A viruses are notorious and exemplary pathogens that cross host-species boundaries. By no means exclusively, yet the potential for IAV cross-species emergence is one of the most keenly felt biological threats in the world. The proclivity of these viruses to circulate in multiple vertebrate species, including many of the species reared agriculturally, poses great importance on understanding how and when host-jumps may occur. In addition, huge amounts of resources are spent annually to try and predict whether IAV will jump between species, and what those consequences may entail. To make these predictions, many different angles of influenza virus biology are analysed from evolutionary history to ecological contact networks to structural/molecular modelling of host-pathogen protein interactions to name but a few. However, exploring these features independently can obfuscate patterns and/or lead to pseudo-replication - is he relationship between protein structures already accounted for by viral phylogenetic histories?
Our cross-specialisation approach first uses protein compositional features to train machine learning models on classifying the original host the virus was obtained from. - The features are x and show y, included them because z
Finally, we compared the results of both modelling and phylogenetic approaches to detect any overlap or, more interestingly, any inferences that were exclusive to only one method.
Sequences were collected from NCBI Virus Database. This dataset was filtered to only include samples with: a) complete nucleotide sequences, missing no more than 10% of the average length for that segment b) metadata containing the virus subtype and original host, at least up to the Class level (Aves or Mammalia) through preferably detailing all the way down to the Family or Species level Nucleotide sequences were then assigned a random identifier (between AAA-000-AAA and ZZZ-999-ZZZ) to associate with metadata throughout analyses. 14,682 avian and 21,737 mammalian sequences were collated at this stage.
Samples from mammalian hosts were then retained only if the original host was within the Canidae, Suidae, Equidae, Hominidae or Phyllostomidae families, as viruses within these hosts were expected to display clear examples of adaptation. Viral sequences of mammalian-origin were then reduced further by excluding samples where the subtype does not circulate naturally within closed systems of that host (i.e. removing spillovers or stuttering chains of transmission, neither of which represent the kind of host adaptation we sought to detect). Included subtypes for each family: Canidae (H3N2, H3N8), Equidae (H7N7, H3N8) and finally Hominidae (H1N1, H1N2, H2N2 and H3N2).
Corresponding protein sequences were then obtained for each sample, most of which were available from the NCBI Virus Database. The few genomic sequences without matching proteins were translated locally using the ‘ape’ package in R (Paradis and Schliep 2019).
Proteins were then aligned using MAFFT (Rozewicki et al. 2019) into 20 resulting protein fasta files (two for each of the 10 major IAV proteins, one containing mammalian-origin viruses and the other avian-origin). These were then clustered using the linear algorithm ‘easy-linclust’ within the tool MMSeq2 (Steinegger and Söding 2018). Three variables of the clustering algorithm were experimented with in optimising data downsampling: the percentage identity threshold required to delineate sequences from one another (–min-seq-id 0.5, 0.75, 0.85, 0.9, 0.95, 0.99), coverage of the sequences within each cluster (-c 0.5, 0.7, 0.8) and the way in which the coverage is calculated (–cov-mode 0, 1). Ultimately options were set to --min-seq-id 0.95, -c 0.8 and --cov-mode 1 in order to grant a representative number of samples without introducing redundancy. Ultimately, 6697 proteins were nominated leading to sequences from 4551 individual viruses.
Proteins from avian-origin and mammalian-origin viruses retained by MMSeq2 were then aligned into a single fasta file for each protein. Accessory proteins (PB1-F2 and PA-X) were then removed from the analyses as both lie within the coding regions of their respective genomic segments and thus the calculation of any features from sequence data would already include these regions. Similarly, those spliced from alternate reading frames (M2 and NEP) were excluded from any further analyses as they were too short and/or conserved to show patterns of adaptation
OmegaFeaturesCLI (Chen et al. 2022) was run over the alignments to produce matrices of protein compositional and physio-chemical properties with the ‘get_descriptor’ command. Features, and the associated command, are as follows: 2mers (‘DPC type 1’), Chou’s pseudo-amino acid composition (‘PseAAC’), conjoined triad (‘CTriad’), and finally Composition (CTDc), & Transition (CTDt) & Distribution (CTDd) values.
Six datasets of protein features were derived from the iFeatureOmega processing: amino acid two-mers (DPC, 400 features), (CTDc, CTDd and CTDt, 39, 195 and 39 features respectively), paac (23 features) and triad (343 features). These were then annotated with the subtype of the origin virus and labelled with the original host.
Matrices of protein features were then used to create Random Forest classification models, to distinguish between the host from which viruses were sampled.
In order to create test data, subtypes that appeared most frequently (>20 times) were iteratively held out of model construction. In addition to these 23 subtypes, both bat-exclusive subtypes (H17N10 and H18N11) were held as test data. The remaining 70 subtypes comprised \(\frac{308}{4551}\) (~7%) sequences. Six Random Forest models were created from each of the feature datasets, for each holdout. These models were then stacked with the use of the ‘caretEnsemble’ package (Deane-Mayer 2024). Models were weighted in order to accommodate for disproportionate sizes of host groups by using the inverse of that group’s proportion.
These datasets were then used to train Random Forest models, leading to 2400 models in total: 8 proteins, 6 chemical and compositional features, 25 subtype holdouts and either two-class or multi-class models. Each models’ training statistics were extracted for preliminary validation. Models for each protein and subtype were then compiled with the stacking algorithm by caret.
| Protein | Model |
|---|---|
| PB2 | FLU+R4 |
| PB1 | FLU+R4 |
| PA | FLU+R5 |
| HA | FLU+R8 |
| NP | Q.mammal+R4 |
| NA | FLU+R5 |
| M1 | Q.mammal+R3 |
| NS1 | HIVw+R6 |
Maximum Likelihood trees were estimated with substitution models suggested by ModelFinder (Table X) and bootstrapped 1000 times for validation.
Phylogenetic models were first trialled on Maximum Likelihood trees with the use of BayesTraits v3.0 (Meade and Pagel 2022); in the R wrapper “btw” (Griffin 2018). A discrete-trait analysis estimated the origin host of viruses and protein features were modelled as continuous traits. Trait analyses were first modelled using tips features and then ancestral trait reconstruction upon internal nodes, both using multi-state reverse-jump MCMC procedures. Initially, transition rates between discrete states (i.e. Host) were examined across each protein tree before the states of ancestral nodes were estimated.
Specific nodes were selected for ancestral trait reconstruction due to known historical host-switching events. Recognised host switch events include: human viruses that originated from avian and swine reassortments in 1918, 1957 and 1968; emergence of swine viruses into human pandemic strains (H1N1pdm09); avian IAV into horses (H7N7 and H3N8) and the jump of equine H3N8 viruses into canine populations.
RateAncestor: this variable can be set to 0 or 1. If RateAncestor = 1, the program is forced to do two additional analyses, which you can ignore if you do not need the results. First, under a model of variable rates across sites such as the gamma, RateAncestor = 1 forces the program to calculate rates for individual sites along the sequence (output that will be saved in the output rates file), using the empirical Bayes procedure (Yang and Wang 1995). Secondly, RateAncestor = 1 forces the program to perform the empirical Bayesian reconstruction of ancestral sequences (Yang et al. 1995b; Koshi and Goldstein 1996; [Yang 2006 pages 119-124(http://abacus.gene.ucl.ac.uk/CME/)]). Ancestral state reconstruction by parsimony (Fitch 1971; Hartigan 1973) is well known (i.e., it is implemented in the PAML program pamp). It can also be achieved using the likelihood/empirical Bayes approach. Often, the two approaches produce similar results, but the likelihood-based reconstruction has two advantages over parsimony: this method uses information from the branch lengths and the relative substitution rates between characters (nucleotides) and provides a measure of uncertainties in the form of posterior probabilities for reconstructed ancestral states.
The results are listed, by site, in the output file rst. You can also use the variable verbose to control how much information you want to be written in this output file. If verbose = 0, only the best nucleotide (the one with the highest posterior probability) at each node at each site is listed, while with verbose = 1 (try 2 if 1 does not work), the full posterior probability distribution from the marginal reconstruction is listed. If the model is homogenous (i.e., if nhomo = 0 or nhomo = 1 have been specified in the control file) and assumes one rate for all sites, both the joint and marginal ancestral reconstructions will be calculated. If the model assumes variable rates among sites like the gamma model, only the marginal reconstructions are calculated. More details about ancestral sequence reconstruction in the section below
| Protein | Avian | Mammalian |
|---|---|---|
| PB2 | 338 | 561 |
| PB1 | 303 | 255 |
| PA | 282 | 415 |
| HA | 1406 | 919 |
| NP | 215 | 244 |
| NA | 185 | 207 |
| M1 | 73 | 149 |
| NS1 | 539 | 606 |
The clustering resulting from removing sequences with a 95% sequence identity left us with 6,697 protein sequences overall, representing 4551 individual virus genomes. To assist with building the model architecture, any virus genome that had at least one representative protein was analysed fully - hence 4551 samples for each of the 8 proteins analysed, despite the high sequence homology between many of them. The number of representative proteins called by the clustering algorithm is shown in Table X and full lists of the associated accession codes are available in Supplementary X.
Confusion matrices made between known origin host and those predicted by the stack models were used to assess model predictive strength and find evidence of possible host shifts. Counter-intuitively, a lower model accuracy implied greater evidence of adaptation to the host species; the mislabelling of proteins is indicative of host-specific adaptations to non-native hosts. For example, should a human-origin virus be incorrectly labelled as swine-origin by the model, this supports the hypothesis that the protein retains some signal of adaptation to swine hosts. Overall, confusion matrices were evaluated based on accuracy and F1 statistics (shown in Table Y).
| Protein | F1.micro | F1.macro | AUC |
|---|---|---|---|
| PB2 | 0.8431141 | 0.5176765 | 0.570 |
| PB1 | 0.8074284 | 0.3840304 | 0.527 |
| PA | 0.8750299 | 0.5749008 | 0.575 |
| HA | 0.7699953 | 0.2947088 | 0.558 |
| NP | 0.8840125 | 0.4957010 | 0.556 |
| NA | 0.7349682 | 0.3321250 | 0.527 |
| M1 | 0.8632233 | 0.5168716 | 0.559 |
| NS1 | 0.9098030 | 0.5188600 | 0.565 |
| Protein | Matched | Mismatched |
|---|---|---|
| 01PB2 | 404 | 145 |
| 02PB1 | 375 | 179 |
| 03PA | 573 | 118 |
| 04HA | 1760 | 240 |
| 05NP | 219 | 173 |
| 06NA | 316 | 76 |
| 07M1 | 145 | 76 |
| 08NS1 | 1008 | 90 |
When did the model predict things incorrectly? Was there any pattern to the mistakes? Not really…
ML trees Trees were first estimated with IQTree2
Transition rates between hosts, based on entire trees when the ONLY feature input to the phylogenetic model is host. i.e. given the distribution of hosts throughout the tree, what is the likelihood that any given branch will transition from host_1 to host_2?
Two new columns will be added to the output “RecNode P(D)” and “RecNode P(G)”, these represent the probability of reconstructing a D or a G at RecNode.
## [1] 3493 1503